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In this paper, we describe a new method for constructing minimal, deterministic, acyclic finite- 
state automata from a set of strings. Traditional methods consist of two phases: the first to con- 
struct a trie, the second one to minimize it. Our approach is to construct a minimal automaton 
in a single phase by adding new strings one by one and minimizing the resulting automaton 
on-the-fly. We present a general algorithm as well as a specialization that relies upon the lexi- 
cographical ordering of the input strings. Our method is fast and significantly lowers memory 
requirements in comparison to other methods. 



1 Introduction 



Finite state automata are used in a variety of applications, including aspects of natural 
language processing (NLP). They may store sets of words, with or without annotations 
such as the corresponding pronunciation, base form, or morphological categories. The 
main reasons for using finite state automata in the NLP domain are that their repre- 
sentation of the set of words is compact and that looking up a string in a dictionary 
represented by a finite-state automaton is very fast — proportional to the length of the 
string. Of particular interest to the NLP community are deterministic, acyclic, finite- 
state automata, which we call dictionaries. 

Dictionaries can be constructed in various ways — see Watson ( fi993aT ; |l99^) for a 
taxonomy of (general) finite state automata construction algorithms. A word is simply a 
finite sequence of symbols over some alphabet and we do not associate it with a mean- 
ing in this paper. A necessary and sufficient condition for any deterministic automaton 
to be acyclic is that it recognizes a finite set of words. The algorithms described here 
construct automata from such finite sets. 

The Myhill-Nerode theorem (see Hopcroft and Ullman (1979 )) states that among the 
many deterministic automata that accept a given language, there is a unique automa- 
ton (excluding isomorphisms) that has a minimal number of states. This is called the 
minimal deterministic automaton of the language. 

The generalized algorithm presented in this paper has been independently devel- 
oped by Jan Daciuk of the Technical University of Gdansk, and by Richard Watson and 
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Bruce Watson (then of the 1ST Technologies Research Group) at Ribbit Software Systems 
Inc. The specialized (to sorted input data) algorithm was independently developed by 
Jan Daciuk and by Stoyan Mihov of the Bulgarian Academy of Sciences. Jan Daciuk has 
made his C++ implementations of the algorithms freely available for research purposes 
at www.pg.gda.pl/ ^jandac/fsa.html|] Stoyan Mihov has implemented the (sorted in- 
put) algorithm in a Java package for minimal acyclic finite-state automata. This package 
forms the foundation of the Grammatical Web Server for Bulgarian (at origin2000.bas.bg) 
and implements operations on acyclic finite automata, such as union, intersection and 
difference, as well as constructions for perfect hashing. Commercial C++ and Java im- 
plementations are available via www.OpenFIRE.org. The commercial implementations 
include several additional features such as a method to remove words from the dictio- 
nary (while maintaining minimality). The algorithms have been used for constructing 
dictionaries and transducers for spell checking, morphological analysis, two-level mor- 
phology, restoration of diacritics, perfect hashing, and document indexing. The algo- 
rithms have also proven useful in numerous problems outside the field of NLP, such as 
DNA sequence matching and computer virus recognition. 

An earlier version of this paper, authored by Daciuk, Watson, and Watson, appeared 
at the Interna tional Workshop on Finite-state Met hods in Natural Language Processing 



in 1998 — see Daciuk, Watson, and Watson (1998). 



2 Mathematical Preliminaries 

We define a deterministic finite state automaton to be a 5-tuple M = (Q, E, 8, q , F), 
where Q is a finite set of states, qo e Q is the start state, F C Q is a set of final states, E 
is a finite set of symbols called the alphabet and S is a partial mapping 8 : Q x E — ► Q 
denoting transitions. When 8(q, a) is undefined, we write S(q, a) = _L. We can extend the 
8 mapping to partial mapping 8* : Q x E* — ► Q as follows (where a£E,i£ E*): 

S*(q,e) = q 

8*(8(q,a),x) if 5(q,a) ^ _L 



5*(q, ax) 



otherwise 



Let DAFSA be the set of all deterministic finite state automata in which the transition 
function 8 is acyclic — there is no string w and state q such that 8* (q, w) = q. 
We define C(M) to be the language accepted by automaton M: 

C(M) = {x e S* | 8*{q ,x) e F } 

The size of the automaton, \M\, is equal to the number of states, \Q\. 7 ? (E*) is the set of 

all languages over E. Define the function C '■ Q — > V(S*) to map a state q to the set of 
all strings on a path from q to any final state in M. More precisely, 

2(q) = {xeX* \S*(q,x)eF} 

C (?) is called the right language of q. Note that C(M) =£ (q ). The right language of a 
state can also be defined recursively: 



C (q) = { a C (S(q, a)) | a £ E A 8{q, a) j4 ± } U 



{e} if q e F 
otherwise 



1 The algorithms in Daciuk's implementation differ slightly from those presented here, as he uses automata 
with final transitions, not final states. Such automata have fewer states and fewer transitions than 
traditional ones. 
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One may ask whether such a recursive definition has a uniq ue solution. Most texts on 
language theory, for example Arbib, Moll and Kfoury (1988), show that the solution is 
indeed unique — it is the least fixed-point of the equation. 

We also define a property of an automaton specifying that all states can be reached 
from the start state: 

Reachable(M) = y qe Q3 xeS ,(S*(q 0l x) = q) 

The property of b eing a minimal automaton is traditionally defined as follows (see WaX.- 
son( jl993b| ; |l99j| ): 

Min(M) = Vm' 6 dafsa(£(M) = C(M') \M\ < \M'\) 

We will, however, use an alternative definition of minimality, which is shown to be 
equivalent: 

Minimal(M) = (V 9 , g / e Q(g ^ q' =>C (q) ^2 (?'))) A Reachable(M) 

A general treatment of automata minimization can be found in Wat son (|1995|). A formal 
proof of the correctness of the following algorithm can be found in Mihov (1998 ). 

3 Construction from Sorted Data 
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Figure 2 

The unique minimal dictionary whose language is the French regular endings of verbs of the 
first group. 



A trie is a dictionary with a tree-structured transition graph in which the start state is 
the root and all leaves are final states]]. An example of a dictionary in a form of a trie is 
given in Figure [l| We can see that many subtrees in the transition graph are isomorphic. 
The equivalent minimal dictionary (Figure |2|) is the one in which only one copy of each 
isomorphic subtree is kept. This means that, pointers (edges) to all isomorphic subtrees 
are replaced by pointers (edges) to their unique representative. 

The traditional method to obtain a minimal dictionary is to first create a (not nec- 
essarily minimal) dictionary for the langu age an d then minimize it using any one of a 
number of algorithms (again, see Watson (1993b; 1995) for numerous examples of such 
algorithms). The first stage is usually done by building a trie, for which there are fast and 
well understood algorithms. Dictionary minimization algorithms are quite efficient in 
terms of the size of their input dictionary — for some algorithms, the memory and time 
requirements are both linear in the number of states. Unfortunately, even such good per- 
formance is not sufficient in practice, where the intermediate dictionary (the trie) can be 
much larger than the available physical memory. (So me effort towards decreasing the 
memory requirement has been made; see fRevuz (1991 ).) This paper presents a way to re- 
duce these intermediate memory requirements and decrease the total construction time 
by constructing the minimal dictionary incrementally (word by word, maintaining an 
invariant of minimality), thus avoiding ever having the entire trie in memory. 

The central part of most automata minimization algorithms is a classification of 



2 There may also be non-leaf, in other words interior, nodes which are final. 
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states. The states of the input dictionary are partitioned such that the equivalence classes 
correspond to the states of the equivalent minimal automaton. Assuming the input dic- 
tionary has only reachable states (that is, Reachable is true), we can deduce (by our alter- 
native definition of minimality) that each state in the minimal dictionary must have a 
unique right language. Since this is a necessary and sufficient condition for minimality, 
we can use equality of right languages as the equivalence relation for our classes. Using 
our definition of right languages, it is easily shown that equality of right languages is 
an equivalence relation (it is reflexive, symmetric and transitive). We will denote two 
states, p and q, belonging to the same equivalence class by p = q (note that = here is dif- 
ferent from its use for logical equivalence of predicates). In the literature, this relation is 
sometimes written as E. 

To aid in understanding, let us traverse the trie (see Figure []]) with the postorder 
method and see how the partitioning can be performed. For each state we encounter, 
we must check whether there is an equivalent state in the part of the dictionary that has 
already been analyzed. If so, we replace the current state with the equivalent state. If not, 
we put the state into a register, so that we can find it easily. It follows that the register 
has the following property: it contains only states which are pairwise inequivalent. We 
start with the (lexicographically) first leaf, moving backward through the trie toward the 
start state. All states up to the first forward-branching state (state with more than one 
outgoing transition) must belong to different classes and we immediately place them 
in the register, since there will be no need to replace them by other states. Considering 
the other branches, and starting from their leaves, we need to know whether or not 
a given state belongs to the same class as a previously registered state. For a given 
state p (not in the register), we try to find a state q in the register that would have the 
same right language. To do this, we do not need to compare the languages themselves 
— comparing sets of strings is computationally expensive. We can use our recursive 
definition of the right language. State p belongs to the same class as q if and only if: 

1. they are either both final or both non-final; and 

2. they have the same number of outgoing transitions; and 

3. corresponding outgoing transitions have the same labels; and 

4. corresponding outgoing transitions lead to states that have the same right 
languages. 

Because the postorder method ensures that all states reachable from the states already 
visited are unique representatives of their classes (i.e. their right languages are unique 
in the visited part of the automaton), we can rewrite the last condition as: 

4' corresponding transitions lead to the same states. 

If all the conditions are satisfied, the state p is replaced by q. Replacing p simply involves 
deleting it while redirecting all of its incoming transitions to q. Note that all leaf states 
belong to the same equivalence class. If some of the conditions are not satisfied, p must 
be a representative of a new class and therefore must be put into the register. 

To build the dictionary one word at a time, we need to merge the process of adding 
new words to the dictionary with the minimization process. There are two crucial ques- 
tions that need to be answered. First, which states (or equivalence classes) are subject 
to change when new words are added? Second, is there a way to add new words to the 
dictionary such that we minimize the number of states that may need to be changed 
during the addition of a word? Looking at Figures |and|, we can reproduce the same 



5 



Computational Linguistics 



Volume 26, Number 1 



postorder traversal of states when the input data is lexicographically sorted. (Note that 
in order to do this, the alphabet S must be ordered, as is the case with ASCII and Uni- 
code). To process a state, we need to know its right language. According to the method 
presented above, we must have the whole subtree whose root is that state. The subtree 
represents endings of subsequent (ordered) words. Further investigation reveals that 
when we add words in this order, only the states that need to be traversed to accept the 
previous word added to the dictionary may change when a new word is added. The 
rest of the dictionary remains unchanged, because a new word either 

• begins with a symbol different from the first symbols of all words already in 
the automaton; the beginning symbol of the new word is lexicographically 
placed after those symbols; or 

• it shares some (or even all) initial symbols of the word previously added to the 
dictionary; the algorithm then creates a forward branch, as the symbol on the 
label of the transition must be later in the alphabet than symbols on all other 
transitions leaving that state. 

When the previous word is a prefix of the new word, the only state that is to be modified 
is the last state belonging to the previous word. The new word may share its ending with 
other words already in the dictionary, which means that we need to create links to some 
parts of the dictionary Those parts, however, are not modified. This discovery leads us 
to the Algorithm [|, shown below. 



Algorithm 1 

Register := 0; 

do there is another word — > 

Word := next word in lexicographic order; 
CommonPrefix := common_prefix(Word); 
LastState := S*(qo, CommonPrefix); 

CurrentSuffix := Word[length(CommonPrefix)+l. . . length(Word)]; 
if has_children(LastState) — * 
replace jor_register(LastState) 

fi; 

addsuffix(LastState, CurrentSuffix) 

od; 

replace _or_register(q ) 

func common_prefix(Word) — > 

return the longest prefix w of Word such that 5* (qo, w) ^ _L 

cnuf 

func replace _or_ register (State) — > 
Child := last xhild( State); 
if has _children( Child) — » 
replace jorsegister(Child) 

fi; 

if 3 qe Q(q e Register A q = Child) — > 
lastjchild(State) := q : (q <E Register A q = Child); 
delete(Child) 
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else 

Register := Register U {Child} 

fi 

cnuf 



The main loop of the algorithm reads subsequent words and establishes which part 
of the word is already in the automaton (the CommonPrefix), and which is not (the Cur- 
rentSuffix). An important step is determining what the last state (here called LastState) 
in the path of the common prefix is. If LastState already has children, it means that not 
all states in the path of previously added word are in the path of the common prefix. In 
that case, by calling the function replace .or register, we can let the minimization process 
work on those states in the path of the previously added word that are not in the com- 
mon prefix path. Then we can add to the LastState a chain of states that would recognize 
the Current Suffix. 

The function common .prefix finds the longest prefix (of the word to be added) that is 
a prefix of a word already in the automaton. The prefix can be empty (since 8* (q, e) = q). 

The function addjuffix creates a branch extending out of the dictionary which rep- 
resents the suffix of the word being added (the maximal suffix of the word which is 
not a prefix of any other word already in the dictionary). The last state of this branch is 
marked as final. 

The function lastjchild returns a reference to the state reached by the lexicographi- 
cally last transition that is outgoing from the argument state. Since the input data is lex- 
icographically sorted, last. child returns the outgoing transition (from the state) most re- 
cently added (during the addition of the previous word). The function replace jor .register 
effectively works on the last child of the argument state. It is called with the argument 
that is the last state in the common prefix path (or the initial state in the last call). We 
need the argument state to modify its transition in those instances in which the child 
is to be replaced with another (equivalent) state. Firstly, the function calls itself recur- 
sively until it reaches the end of the path of the previously added word. Note that when 
it encounters a state with more than one child, it takes the last one, as it belongs to the 
previously added word. As the length of words is limited, so is the depth of recursion. 
Then, returning from each recursive call, it checks whether a state equivalent to the cur- 
rent state can be found in the register. If this is true, then the state is replaced with the 
equivalent state found in the register. If not, the state is registered as a representative of 
a new class. Note that the function replace sir. register processes only the states belong- 
ing to the path of the previously added word (a part, or possibly all of those created 
with the previous call to add suffix), and that those states are never reprocessed. Finally, 
hasxhildren returns true if, and only if, there are outgoing transitions from the state. 

During the construction, the automaton states are either in the register or on the 
path for the last added word. All the states in the register are states in the resulting 
minimal automaton. Hence the temporary automaton built during the construction has 
less states than the resulting automaton plus the length of the longest word. Memory 
is needed for the minimized dictionary that is under construction, the call stack, and 
for the register of states. The memory for the dictionary is proportional to the number 
of states and the total number of transitions. The memory for the register of states is 
proportional to the number of states and can be freed once construction is complete. By 
choosing an appropriate implementation method one can achieve a memory complexity 
0(n) for a given alphabet, where n is the number of states of the minimized automaton. 
This is an important advantage of our algorithm. 
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For each letter from the input list, the algorithm either has to make a step in the 
function commonjprefix or add a state in the procedure add_sufix. Both operations can 
be performed in constant time. Each new state that has been added in the procedure 
addsufix has to be processed exactly once in the procedure replace _or '.register : The num- 
ber of states that have to be replaced or registered is clearly smaller than the number 
of letters in the input listQ The processing of one state in the procedure consists of one 
register search and possibly one register insertion. The time complexity of the search is 
0(log n),where n is the number of states in the (minimized) dictionary. The time com- 
plexity of adding a state to the register is also O(logn). In practice, however, by using 
a hash table to represent the register (and equivalence relation), the average time com- 
plexity of those operations can be made almost constant. Hence the time complexity of 
the whole algorithm is 0(1 log n), where I is the total number of letters in the input list. 

4 Construction from Unsorted Data 

Sometimes it is difficult or even impossible to sort the input data before constructing a 
dictionary. For example, there may be insufficient time or storage space to sort the data 
or the data originates in another program or physical source. An incremental dictionary- 
building algorithm would still be very useful in those situations, although unsorted 
data makes it more difficult to merge the trie-building and the minimization processes. 
We could leave the two processes disjoint, although this would lead to the traditional 
method of constructing a trie and minimizing it afterwards. A better solution is to min- 
imize everything on-the-fly, possibly changing the equivalence classes of some states 
each time a word is added. Before actually constructing a new state in the dictionary, 
we first determine if it would be included in the equivalence class of a preexisting state. 
Similarly, we may need to change the equivalence classes of previously constructed 
states since their right languages may have changed. This leads to an incremental con- 
struction algorithm. Naturally, we would want to create the states for a new word in an 
order that would minimize the creation of new equivalence classes. 

As in the algorithm for sorted data, when a new word w is added, we search for 
the prefix of w already in the dictionary. This time, however, we cannot assume that 
the states traversed by this common prefix will not be changed by the addition of the 
word. If there are any preexisting states traversed by the common prefix that are al- 
ready targets of more than one in-transition (known as confluence states), then blindly 
appending another transition to the last state in this path (as we would in the sorted al- 
gorithm) would accidentally add more words than desired (see Figure ^| for an example 
of this). 

To avoid generation of such spurious words, all states in the common prefix path 
from the first confluence state must be cloned. Cloning is the process of creating a new 
state that has outgoing transitions on the same labels and to the same destination states 
as a given state. If we compare the minimal dictionary (Figure [l]) to an equivalent trie 
(Figure ||), we notice that a confluence state can be seen as a root of several original, 
isomorphic subtrees merged into one (as described in the previous section). One of the 
isomorphic subtrees now needs to be modified (leaving it no longer isomorphic), so 
it must first be separated from the others by cloning its root. The isomorphic subtrees 
hanging off these roots are unchanged, so the original root and its clone have the same 
outgoing transitions (that is, transitions on the same labels and to the same destination 



3 The exact number of the states that are processed in the procedure replace j>r .register is equal to the 
number of states in the trie for the input language. 
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Figure 3 

The result of blindly adding the word bae to a minimized dictionary (appearing on the left) 
containing abd and bad. The rightmost dictionary inadvertently contains abe as well. The lower 
dictionary is correct — state 3 had to be cloned. 



states). 

In the algorithm |J the confluence states were never traversed during the search for 
the common prefix. The common prefix was not only the longest common prefix of the 
word to be added and all the words already in the automaton. It was also the longest 
common prefix of the word to be added and the last (i.e. the previous) word added to the 
automaton. As it was the function replace .or .register that created confluence states, and 
that function was never called on states belonging to the path of the last word added to 
the automaton, those states could never be found in the common prefix path. 

Once the entire common prefix is traversed, the rest of the word must be appended. 
If there are no confluence states in the common prefix, then the method of adding the 
rest of the word does not differ from the method used in the algorithm for sorted data. 
However, we need to withdraw (from the register) the last state in the common prefix 
path in order not to create cycles. This is in contrast to the situation in the algorithm for 
sorted data where that state is not yet registered. Also, CurrentSuffix could be matched 
with a path in the automaton containing states from the common prefix path (including 
the last state of the prefix). 

When there is a confluence state, then we need to clone some states. We start with 
the last state in the common prefix path, append the rest of the word to that clone and 
minimize it. Note that in this algorithm, we do not wait for the next word to come, so 
we can minimize (replace or register the states of) CurrentSuffix state by state as they 
are created. Adding and minimizing the rest of the word may create new confluence 
states earlier in the common prefix path, so we need to rescan the common prefix path 
in order not to create cycles, as illustrated in Figure |[ Then we proceed with cloning 
and minimizing the states on the path from the state immediately preceding the last 
state to the current first confluence state. 

Another, less complicated but also less economical, method can be used to avoid the 
problem of creating cycles in presence of confluence states. In that solution, we proceed 
from the state immediately preceding the confluence state towards the end of the com- 
mon prefix path, cloning the states on the way. But first, the state immediately preceding 
the first confluence state should be removed from the register. At the end of the com- 
mon prefix path, we add the suffix. Then, we call replace. or .register with the predecessor 
of the state immediately preceding the first confluence state. The following should be 
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Figure 4 

Consider an automaton (shown in solid lines on the left) accepting abcde and fghde. Suppose we 
want to add fghdghde. As the common prefix path (shown in thicker lines) contains a confluence 
state, we clone state 5 to obtain state 9, add the suffix to state 9, and minimize it. When we also 
consider the dashed lines in left figure, we see that state 8 became a new confluence state earlier 
in the common prefix path. The right figure shows what could happen if we did not rescan the 
common prefix path for confluence states. State 10 is a clone of state 4. 



noted about this solution: 

• memory requirements are higher, as we keep more than one isomorphic state 
at a time, 

• the function replace _or .register must remain recursive (as in the sorted version), 
and 

• the argument to replace sir .register must be a string, not a symbol, in order to 
pass subsequent symbols to children. 

When the process of traversing the common prefix (up to a confluence state) and 
adding the suffix is complete, further modifications follow. We must recalculate the 
equivalence class of each state on the path of the new word. If any equivalence class 
changes, we must also recalculate the equivalence classes of all of the parents of all of 
the states in the changed class. Interestingly, this process could actually make the new 
dictionary smaller. For example, if we add the word abe to the dictionary at the bottom 
of Figured while maintaining minimality, we obtain the dictionary shown in the right 
of Figure || which is one state smaller. The resulting algorithm is shown in Algorithm ^. 



Algorithm 2 

Register := 0; 

do there is another word — > 
Word := next word; 

CommonPrefix := commonjprefix(Word); 

CurrentSuffix := Word[length(CommonPrefix)+l. . . length(Word)]; 
if CurrentSuffix = e A 5*(qo, CommonPrefix) 6 F — > 
continue 

fi; 

FirstState := first state{CommonPreftx) ; 
if FirstState = -> 

LastState := 8*(q , CommonPrefix) 
else 

LastState := clone(8*(qo, CommonPrefix)) 

fi; 
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addsuffix(LastState, CurrentSuffix); 
if FirstState ^%-> 
FirstState := first jtate(CommonPrefix); 
Currentlndex := (length(x): 5*(q , x) = FirstState); 
for i from length(CommonPrefix) - 1 downto Currentlndex 
CurrentState := clone(S*(qo, CommonPrefix[l. . . i])); 
S(CurrentState, CommonPrefix[i]) := LastState; 
replace .or _register(CurrentState,Word[i+l]); 
LastState := CurrentState 

rof 
else 

Currentlndex := length(CommonPrefix) 

fi; 

Changed := true; 
do Changed — ► 

Currentlndex := Currentlndex - 1; 

CurrentState := 8*(q , Word[l. . . Currentlndex]); 

OldState := LastState; 

if Currentlndex > — > 

Register := Register - {LastState} 

fi; 

replace jyr .register (CurrentState or d[Currentlndex+l]); 
LastState := ^(CurrentState, Word[CurrentIndex+l]); 
Changed := OldState ^ LastState 
LastState := CurrentState 

od 

if -^Changed A Currentlndex > — > 
Register := Register U {CurrentState} 

fi 

od 

func replace j>r .register (State, Symbol) — > 
CMd := 5(State, Symbol); 
if 3q E Q(q E Register A g = CMrf) — > 
deZefe(C7i27d); 

last_child(State) := q : (q e Register A q = Child) 
else 

Register := Register U{Child} 

fi 

cnuf 



The main loop reads the words, finds the common prefix, and tries to find the first 
confluence state in the common prefix path. Then the remaining part of the word (Cur- 
rentSuffix) is added. 

If a confluence state is found (i.e. FirstState points to a state in the automaton), all 
states from the first confluence state to the end of the common prefix path are cloned, 
and then considered for replacement or registering. Note that the inner loop (with i as 
the control variable) begins with the penultimate state in the common prefix, because 
the last state has already been cloned and the function replace uor. register acts on a child 
of its argument state. 
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Addition of a new suffix to the last state in the common prefix changes the right 
languages of all states that precede that state in the common prefix path. The last part 
of the main loop deals with that situation. If the change resulted in such modification 
of the right language of a state that an equivalent state can be found somewhere else 
in the automaton, then the state is replaced with the equivalent one and the change 
propagates towards the initial state. If the replacement of a given state cannot take place, 
then (according to our recursive definition of the right language) there is no need to 
replace any preceding state. 

Several changes to the functions used in the sorted algorithm are necessary to han- 
dle the general case of unsorted data. The replace .or .register procedure needs to be mod- 
ified slightly. Since new words are added in arbitrary order, one can no longer assume 
that the last child (lexicographically) of the state (the one that has been added most 
recently) is the child whose equivalence class may have changed. However, we know 
the label on the transition leading to the altered child, so we use it to access that state. 
Also, we do not need to call the function recursively. We assume that addjsuffix replaces 
or registers the states in the CurrentSuffix in correct order; later we process one path of 
states in the automaton, starting from those most distant from the initial state, proceed- 
ing towards the initial state qo. So in every situation in which we call replace _or .register, 
all children of the state Child are already unique representatives of their equivalence 
classes. 

Also, in the sorted algorithm, addsuffix is never passed e as an argument, whereas 
this may occur in the unsorted version of the algorithm. The effect is that the LastState 
should be marked as final since the common prefix is, in fact, the entire word. In the 
sorted algorithm, the chain of states created by addsuffix was left for further treatment 
until new words are added (or until the end of processing). Here, the automaton is 
completely minimized on the fly after adding a new word, and the function addsuffix 
can call replace sir .register for each state it creates (starting from the end of the suffix). 
Finally, the new functionyiVsf state simply traverses the dictionary using the given word 
prefix and returns the first confluence state it encounters. If no such state exists, first state 
returns 0. 

As in the sorted case, the main loop of the unsorted algorithm executes m times, 
where m is the number of words accepted by the dictionary. The inner loops are exe- 
cuted at most | to | times for each word. Putting a state into the register takes O(logn), 
although it may be constant when using a hash table. The same estimation is valid for a 
removal from the register. In this case, the time complexity of the algorithm remains the 
same, but the constant changes. Similarly, hashing can be used to provide an efficient 
method of determining the state equivalence classes. For sorted data, only a single path 
through the dictionary could possibly be changed each time a new word is added. For 
unsorted data, however, the changes frequently fan-out and percolate all the way back 
to the start state, so processing each word takes more time. 

4.1 Extending the algorithms 

These new algorithms can also be used to construct transducers. The alphabet of the 
(transducing) automaton would be Si x S 2 , where Ei and S 2 are the alphabet of the 
levels. Alternatively, elements of YT 2 can be associated with the final states of the dictio- 
nary and only output once a valid word from Y>\ is recognized. 

5 Related work 



An algorithm described by Revuz (1991) also constructs a dictionary from sorted data 



while performing a partial minimization on-the-fly. Data is sorted in reverse order and 
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that property is used to compress the endings of words within the dictionary as it 
is being built. This is called a pseudominimization and must be supplemented by a 
true minimization phase afterwards. The minimization phase still involves finding an 
equivalence relation over all of the states of the pseudo-minimal dictionary. It is possi- 
ble to use unsorted data but it produces a much bigger dictionary in the first stage of 
processing. However, the time complexity of the minimization can be reduced some- 
what by using knowledge of the pseudo-minimization process. Although this pseudo- 
minimization technique is more economic in its use of memory than traditional tech- 
niques, we are still left with a sub-minimal dict ionary which can be a factor of 8 times 



larger than the equivalent minimal dictionary ( flRevuz, 1991| , page 33), reporting on the 
DELAF dictionary). 



Recently a semi-incremental algorithm was described by IWatson (1998b at the Work 



shop on Implementing Automata. That algorithm requires the words to be sorted in any 
order of decreasing length (this sorting process can be done in linear time), and takes ad- 
vantage of similar automata properties to those presented in this paper. In addition, the 
algorithm requires a final minimization phase after all words have been added. For this 
reason, it is only semi-incremental and does not maintain full minimality while adding 
words — although it usually maintains the automata close enough to minimality for 
practical applications. 

6 Conclusions 

We have presented two new methods for incrementally constructing a minimal, deter- 
ministic, acyclic finite state automaton from a finite set of words (possibly with corre- 
sponding annotations). Their main advantage is their minimal intermediate memory 
requirements-^ The total construction time of these minimal dictionaries is dramatically 
reduced from previous algorithms. The algorithm constructing a dictionary from sorted 
data can be used in parallel with other algorithms that traverse or utilize the dictionary 
since parts of the dictionary that are already constructed are no longer subject to future 
change. 



4 It is minimal in asymptotic terms; naturally compact data structures can also be used. 
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